Semantic similarity

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content.

Concretely, this can be achieved for instance by defining a topological similarity, by using ontologies to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph like a hierarchy would be the minimal distance—in separating edges—between the two term nodes), or using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus (co-occurrence).

Contents

Taxonomy

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy, while similarity does not .[1] However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.

Visualisation

An intuitive way of visualising the semantic similarity of terms is by grouping together closer related terms and spacing more distantly related ones wider apart. This is also common - if sometime subconscious - practice for mind maps and concept maps.

Applications

Biomedical Informatics

Semantic similarity measures have been applied and developed in biomedical ontologies,[2] [3]namely, the Gene Ontology (GO). They are mainly used to compare genes and proteins based on the similarity of their functions rather than on their sequence similarity, but they are also being extended to other bioentities, such as chemical compounds[4] and diseases.[5]

These comparisons can be done using tools freely available on the web:

GeoInformatics

Similarity is also applied to find similar geographic features or feature types:

Linguistics

Several metrics use WordNet: (+) humanly constructed; (−) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary

Measures

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:

Other measures calculate the similarity between ontological instances:

Some examples:

Edge-based

Node-based

Pairwise

Groupwise

Statistical similarity

Software

Web Services

See also

Notes

  1. ^ Budanitsky, Alexander; Hirst, Graeme (2001). "Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures". Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh 
  2. ^ Pesquita, Catia; Faria, Daniel; Falcão, André O.; Lord, Phillip; Couto, Francisco M. (2009). Bourne, Philip E.. ed. "Semantic Similarity in Biomedical Ontologies". PLoS Computational Biology 5 (7): e1000443. doi:10.1371/journal.pcbi.1000443. PMC 2712090. PMID 19649320. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2712090. 
  3. ^ Benabderrahmane, Sidahmed; Smail Tabbone, Malika; Poch, Olivier; Napoli, Amedeo; Devignes, Marie-Domonique. (2010). "IntelliGO: a new vector-based semantic similarity measure including annotation origin". Biomed Central 11: 588. doi:10.1186/1471-2105-11-588. PMID 21122125. 
  4. ^ Ferreira, João D.; Couto, Francisco M. (2010). Mitchell, John B. O.. ed. "Semantic Similarity for Automatic Classification of Chemical Compounds". PLoS Computational Biology 6 (9): e1000937. doi:10.1371/journal.pcbi.1000937. PMC 2944781. PMID 20885779. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2944781. 
  5. ^ Köhler, S; Schulz, MH; Krawitz, P; Bauer, S; Dolken, S; Ott, CE; Mundlos, C; Horn, D et al. (2009). "Clinical diagnostics in human genetics with semantic similarity searches in ontologies". American journal of human genetics 85 (4): 457–64. doi:10.1016/j.ajhg.2009.09.003. PMC 2756558. PMID 19800049. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2756558. 
  6. ^ Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 (IJCAI'95), Chris S. Mellish (Ed.), Vol. 1. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 448-453
  7. ^ Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296-304
  8. ^ J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In International Conference Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997
  9. ^ Couto, F. & Silva, M. (2011), Disjunctive Shared Information between Ontology Concepts: application to Gene Ontology. Journal of Biomedical Semantics, 2:5
  10. ^ Couto, F., Silva, M., & Coutinho, P. (2007). Measuring semantic similarity between Gene Ontology terms. Data and Knowledge Engineering, 61:137–152
  11. ^ Catia Pesquita, Daniel Faria, Hugo Bastos, António Ferreira, Andre O Falcao, Francisco Couto 2008: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics Suppl 5(9), S4
  12. ^ |title= Google Similarity Distance

References

External links